About the project

This course teaches students to use the statistical programming software R to prepare and analyze data.

Regression and model validation

This week I have learned to setup, perform and analyze simple linear regression with multiple variables.

lrn14<-read.csv("learning2014.csv",row.names=1)
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166   7

This dataset includes 166 rows and 7 columns (variables). The variables are Age, Gender, (exam) Points, Attitude, Surface Learning, Strategic Learning and Deep Learning.

install.packages(“GGally”) install.packages(“ggplot2”)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.2
pairs(lrn14[-1],col=lrn14$gender)

ggpairs(lrn14, mapping=aes(col=gender, alpha=0.3), lower=list(combo=wrap("facethist",bins=20)))

summary(lrn14)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Most of the variables have relatively normal distributions, with the exception of age, which has a right-skewed distribution. The male attitude distribution could also be considered left-skewed. Most of the variables have weak relationships with each other, as exhibited by correlations less than 0.1. Notably, points and attitude have a strong positive relationship with a correlation value of 0.437. Deep learning and surface learning have a fairly strong negative correlation at -0.324, meaning that as deep learning increases, the tendency toward surface learning decreases. Surprisingly, median deep learning is higher than median surface learning, although there is higher variation with regard to deep learning.

my_model<-lm(Points~attitude+deep+surf, data=lrn14)
summary(my_model)
## 
## Call:
## lm(formula = Points ~ attitude + deep + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

The only tested explanatory variable that exhibits a statistically significant relationship with the target variable points is attitude, as indicated by a p-value less than 0.05. For every 1 point increase in a student’s measured attitude, there is a 6.011 point increase in points scored on the exam. There is no significant relationship different from 0 between the explanatory variables deep learning and surface learning and the target variable points (exam score).

Ordinary least squares regression determines the Betas, or best fit lines, that minimize the differences between the fitted value and the observed value for each of the explanatory variables. The provided t-values represent the slopes of the linear best fit lines for each explanatory variable, and thus the relationship betewen the explanatory variable and the target variable. The p-value determines whether that is relationship is statistically significant, or due to chance.

my_model2<-lm(Points~attitude,data=lrn14)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Removing the explanatory variables surface learning and deep learning, which did not exhibit a statistically significant relationship with the target variable points, attitude remains statistically significant and exhibits a slightly larger relationship with/effect on points. The intercept also increases in value from 3.895 to 6.358. The median residual increases from the prior model, while the multiple r-squared value falls from 0.2024 to 0.1906, meaning that the second model has less explanatory power than the first.

R-squared is a measure of how close the true observations are to the fitted line, and thus a determination of how well the model explains the variation of observations around the data’s mean. It tells you how much of the variance in the target variable can be explained by the explanatory variables. R-squared values range from 0 to 1, with 0 indicating that the model explains none of the variation, and 1 indicating that the explanatory variables used explain all of the variation in the target variable. My model has an R-squared of 0.2024, meaning that it has relatively low explanatory power and there are substantial differences between the fitted lines for each explanatory variable and the actual observations.

plot(my_model2, which=c(1:2,5))

The assumptions of the linear regression model are that the errors are normally distributed, the errors are not correlated and the errors have constant varaiance (and, as such, the size of an error does not depend on the explanatory variables). For my model, the normality assumption is reasonable, because the points are for the most part tightly fitted to the line in the Q-Q Plot and there are no extreme outliers. The constant variance assumption is also reasonable, because there is a random spread of points with no clear pattern in the graph comparing residuals and model predictions. The Residuals vs. Leverage plot indicates that the data has regular leverage, because there is no significant outlier (the leverage values range from 0 to 0.04) that has a disproportionate effect on the model.


Logistic Regression

Introduction to logistic regression and cross validation

alc <- read.csv("alc.csv",row.names=1)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 382  35

The data inclues 382 rows (students) and 35 variables/columns. Each variable indicates attributes and grade information about students in either math or Portuguese language courses.

I will test Pstatus, studytime, failures and higher in relation to high/low alcohol consumption. For those students with Pstatus apart (false), there will be a higher probability of high alcohol consumption. For those students with higher weekly study times, there will be a lower probability of high aclohol consumption. For those students with a higher number of failures, there will be a higher probability of high alcohol consumption. For those students who want to pursue higher education, there will be a lower probablility of high aclohol consumption.

install.packages(“GGally”) install.packages(“ggplot2”)

library(ggplot2)
library(GGally)
bar<-ggplot(data=alc, aes(x=high_use))
bar+facet_wrap("Pstatus")+geom_bar()

bar+facet_wrap("studytime")+geom_bar()

bar+facet_wrap("failures")+geom_bar()

bar+facet_wrap("higher")+geom_bar()

table(high_use=alc$high_use, Pstatus=alc$Pstatus)
##         Pstatus
## high_use   A   T
##    FALSE  26 242
##    TRUE   12 102
table(high_use=alc$high_use, studytime=alc$studytime)
##         studytime
## high_use   1   2   3   4
##    FALSE  58 135  52  23
##    TRUE   42  60   8   4
table(high_use=alc$high_use, failures=alc$failures)
##         failures
## high_use   0   1   2   3
##    FALSE 244  12  10   2
##    TRUE   90  12   9   3
table(high_use=alc$high_use, higher=alc$higher)
##         higher
## high_use  no yes
##    FALSE   9 259
##    TRUE    9 105

Based on the bar graphs and cross-tabulations, my hypotheses generally seem to be appropriate. For students whose parents live together, a higher proportion have low alcohol consumption compared with students whose parents are separated. Generally, the more students study the more likely they are to have low alcohol consumption. For students who plan to pursue higher education, a higher proportion have low alcohol consumption compared to students who don’t plan to pursue higher education. In terms of failures, however, there is a less clear relationship with alcohol consumption.

model<-glm(high_use~Pstatus+studytime+failures+higher, data=alc, family="binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ Pstatus + studytime + failures + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5570  -0.7981  -0.7981   1.3000   2.0759  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.41983    0.68428   0.614 0.539520    
## PstatusT    -0.08093    0.37705  -0.215 0.830043    
## studytime   -0.52540    0.15683  -3.350 0.000808 ***
## failures     0.34848    0.19764   1.763 0.077866 .  
## higheryes   -0.26871    0.53569  -0.502 0.615936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 443.89  on 377  degrees of freedom
## AIC: 453.89
## 
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept)    PstatusT   studytime    failures   higheryes 
##  0.41983005 -0.08093225 -0.52540334  0.34848048 -0.26871220

The only stastically significant predictor of high aclohol consumption at the 0 level is study time. For every one unit increase in study time, there is a 0.52540 decresase in a student’s log odds of exhibiting high alcohol consumption.

install.packages(“tidyr”)

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.2
OR<-coef(model)%>%exp
CI<-exp(confint(model))
## Waiting for profiling to be done...
cbind(OR,CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 1.5217029 0.3876426 5.7868685
## PstatusT    0.9222562 0.4488646 1.9917762
## studytime   0.5913168 0.4308370 0.7978098
## failures    1.4169129 0.9625861 2.1018616
## higheryes   0.7643632 0.2675755 2.2483346

Pstatus is associated with almost 1:1 odds of high alcohol consumption, but the odds of success are slightly lower for students whose parents are together, meaning that PstatusT is negatively associated with success(high aclohol consumption). Study time is associated with approximately 3:5 odds, indicating that those students who study more have lower odds of exhibiting high alcohol consumption. Failures is associated with approximately 3:2 odds, meaning the odds of success are higher for those students with more failures and thus failure is positively associated with high alcohol consumption. Higher education exhibits 3:4 odds, indicating that the desire to puruse higher education is negatively associated with high alchol consumption. Since none of the Confidence Intervals contain 1.0, all of the explanatory variables exhibit statistically significant associations with the target variable (high alcohol consumption) at the p<0.05 level.

Overall, the relationships posited in my hypothesis were correct, but only study time exhibited a statistically significant relationship with alcohol consumption in the regression model.

model2<-glm(high_use~studytime, data=alc, family="binomial")
summary(model2)
## 
## Call:
## glm(formula = high_use ~ studytime, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0603  -0.8314  -0.8314   1.2993   2.1010  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3209     0.3076   1.043    0.297    
## studytime    -0.6029     0.1530  -3.941  8.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 448.31  on 380  degrees of freedom
## AIC: 452.31
## 
## Number of Fisher Scoring iterations: 4
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
probabilities<-predict(model2, type="response")
alc<-mutate(alc, probability=probabilities)
alc<-mutate(alc, prediction=probability>0.5)
table(high_use=alc$high_use, prediction=alc$prediction)
##         prediction
## high_use FALSE
##    FALSE   268
##    TRUE    114
select(alc, studytime, high_use, probability, prediction) %>% head(20)
##    studytime high_use probability prediction
## 1          2    FALSE   0.2921823      FALSE
## 2          2    FALSE   0.2921823      FALSE
## 3          2     TRUE   0.2921823      FALSE
## 4          3    FALSE   0.1842725      FALSE
## 5          2    FALSE   0.2921823      FALSE
## 6          2    FALSE   0.2921823      FALSE
## 7          2    FALSE   0.2921823      FALSE
## 8          2    FALSE   0.2921823      FALSE
## 9          2    FALSE   0.2921823      FALSE
## 10         2    FALSE   0.2921823      FALSE
## 11         2    FALSE   0.2921823      FALSE
## 12         3    FALSE   0.1842725      FALSE
## 13         1    FALSE   0.4299752      FALSE
## 14         2    FALSE   0.2921823      FALSE
## 15         3    FALSE   0.1842725      FALSE
## 16         1    FALSE   0.4299752      FALSE
## 17         3    FALSE   0.1842725      FALSE
## 18         2    FALSE   0.2921823      FALSE
## 19         1     TRUE   0.4299752      FALSE
## 20         1    FALSE   0.4299752      FALSE

Interestingly, using the model with only study time as an explanatory variable, there are no predicted probabilities above .5 and thus no predictions of true (high alcohol consumption).

library(ggplot2)
g<-ggplot(alc, aes(x=probability, y=high_use, col=prediction))
g+geom_point()

loss_func<-function(class,prob){
  n_wrong<-abs(class-prob)>0.5
  mean(n_wrong)
}

loss_func(class=alc$high_use, prob=alc$probability)
## [1] 0.2984293

The training error is approximately .3, meaning that 30 percent of students are inaccurately classified in terms of their alcohol consumption (high/low). Simply guessing without a strategy gives you a 50 percent chance of being correct in a binary situation such as this, and guessing with some intuition/strategy most likely gives you an even higher chance of being correct. Thus, the performance of the model is perceivably not that much different from a simple guessing strategy.

install.packages(“boot”)

library(boot)
cv<-cv.glm(data=alc, cost=loss_func, glmfit=model, K=10)
cv$delta[1]
## [1] 0.3141361
cv<-cv.glm(data=alc, cost=loss_func, glmfit=model2, K=10)
cv$delta[1]
## [1] 0.2984293

Both models have higher average prediction error than the DataCamp model. To find such a model, you could test different amounts and types of explanatory variables together.


Clustering and Classification

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset Boston has 506 rows and 14 columns/variables. The variables are characterised by factors that influence housing values in the suburbs of Boston.

install.packages(“corrplot”)

library(tidyr)
library(corrplot)
pairs(Boston)

cor_matrix<-cor(Boston)%>%round
cor_matrix
##         crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
## crim       1  0     0    0   0  0   0   0   1   1       0     0     0    0
## zn         0  1    -1    0  -1  0  -1   1   0   0       0     0     0    0
## indus      0 -1     1    0   1  0   1  -1   1   1       0     0     1    0
## chas       0  0     0    1   0  0   0   0   0   0       0     0     0    0
## nox        0 -1     1    0   1  0   1  -1   1   1       0     0     1    0
## rm         0  0     0    0   0  1   0   0   0   0       0     0    -1    1
## age        0 -1     1    0   1  0   1  -1   0   1       0     0     1    0
## dis        0  1    -1    0  -1  0  -1   1   0  -1       0     0     0    0
## rad        1  0     1    0   1  0   0   0   1   1       0     0     0    0
## tax        1  0     1    0   1  0   1  -1   1   1       0     0     1    0
## ptratio    0  0     0    0   0  0   0   0   0   0       1     0     0   -1
## black      0  0     0    0   0  0   0   0   0   0       0     1     0    0
## lstat      0  0     1    0   1 -1   1   0   0   1       0     0     1   -1
## medv       0  0     0    0   0  1   0   0   0   0      -1     0    -1    1
corrplot(cor_matrix, method="circle",type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6 )

According to the correlation plot, there appear to be strong relationships between a number of the variables in the dataset. For example, there is a strong, positive relationship between per capita crime rates and property tax rates. There is a strong, negative relationship between property tax rates and average distances to five Boston employment centers. Both of these relationships seem counterintuitive.

boston_scaled<-scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The numbers in the dataset were scaled down overall, as was the range.

boston_scaled<-as.data.frame(boston_scaled)
scaled_crim<-boston_scaled$crim
quants<-quantile(scaled_crim)
summary(quants)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  1.742000  0.007389  9.924000
string<-c("low", "med_low", "med_high", "high")
crime<-cut(scaled_crim, breaks=quants, include.lowest=TRUE, label=string)
summary(crime)
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled<-dplyr::select(boston_scaled, -crim)
boston_scaled<-data.frame(boston_scaled, crime)
n<-nrow(boston_scaled)
samp<-sample(n, size=n*0.8)
train<-boston_scaled[samp,]
test<-boston_scaled[-samp,]
lda.fit<-lda(crime~.,data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2673267 0.2400990 0.2376238 0.2549505 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.8885998 -0.9011685 -0.12651054 -0.8651577  0.43361013
## med_low  -0.1566000 -0.2888065  0.01179157 -0.5406093 -0.17857441
## med_high -0.3970195  0.2184795  0.13778554  0.4106588  0.03564202
## high     -0.4872402  1.0170891 -0.08120770  1.0454118 -0.48033094
##                 age        dis        rad        tax     ptratio
## low      -0.8695139  0.8665520 -0.6894375 -0.7292938 -0.42596927
## med_low  -0.2355538  0.2879135 -0.5509001 -0.4766305 -0.05231797
## med_high  0.4638692 -0.3910414 -0.4339567 -0.3119223 -0.15315640
## high      0.8046973 -0.8630158  1.6384176  1.5142626  0.78111358
##                black       lstat        medv
## low       0.37408443 -0.76101461  0.50635596
## med_low   0.31651592 -0.09374727 -0.01701815
## med_high  0.04108174  0.07675086  0.07197759
## high     -0.71226428  0.90701275 -0.70418858
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08150335  0.61623334 -0.95768361
## indus    0.04537990 -0.21365597  0.20903420
## chas    -0.11761334 -0.05013481  0.11248816
## nox      0.30203491 -0.74367359 -1.56110424
## rm      -0.13821916 -0.05915772 -0.23417902
## age      0.29592046 -0.36834064 -0.03840641
## dis     -0.05669083 -0.16030236 -0.01774081
## rad      3.42129172  0.97315017 -0.08034830
## tax      0.02549263  0.01035384  0.67604958
## ptratio  0.08061452 -0.04385477 -0.41161506
## black   -0.09242340  0.06197783  0.14752305
## lstat    0.17219088 -0.15568157  0.28561737
## medv     0.14994846 -0.32084803 -0.22701803
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9538 0.0342 0.0120
lda.arrows<-function(x, myscale=1, arrow_heads=0.1, color="red", tex=0.75, choices=c(1,2)){
  heads<-coef(x)
  arrows(x0=0, y0=0,
    x1=myscale*heads[,choices[1]],
    y1=myscale*heads[,choices[2]],
col=color, length=arrow_heads)
text(myscale*heads[,choices],labels=row.names(heads),
      cex=tex,col=color, pos=3)
}
classes<-as.numeric(train$crime)
plot(lda.fit, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale=1)

Low, med_low and med_high training data are all intermixed and spacially tight, while the high training data are mostly clustered together, separate from other categories and spacially distant (with the exception of a few med_high data intermixed and one high entry mixed into the other cluster). It appears as though the high predictions will be the most accurate (most likely because the training data are the most spacially distant from the others), followed by the low predictions. The med_low and med_high predictions seem as though they will be relatively inaccurate, and the categories could most likely be slimmed to two or three classifications/clusters.

correct_classes<-test$crime
test<-dplyr::select(test,-crime)
lda.pred<-predict(lda.fit, newdata=test)
table(correct=correct_classes, predicted=lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       4        1    0
##   med_low   10      16        3    0
##   med_high   1       8       19    2
##   high       0       0        0   24

The LDA model was much more accurate at predicting the med_high and high results-it correctly predicted every high result. Low results were predicted correctly just over 50 percent of the time, although the incorrect predictions were all med_low. Med_low results were predicted under 50 percent of the time, and predictions ranged form low to med_high, though med_low was predicted the most.

library(MASS)
data("Boston")
boston_scaled<-scale(Boston)
dist_eu<-dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
km<-kmeans(dist_eu, centers=15)
pairs(Boston, col=km$cluster)

k_max<-15
twcss<-sapply(1:k_max, function(k){kmeans(dist_eu,k)$tot.withinss})
plot(1:k_max, twcss, type='b')

km2<-kmeans(dist_eu, centers=2)
pairs(Boston, col=km2$cluster)

Average number of rooms per dwelling (rm), nitrogen oxide concentration (nox) and proportion of owner-occupied units built prior to 1940 (age) seem to affect the clustering results.

library(MASS)
data("Boston")
boston_scaled<-scale(Boston)
dist_eu<-dist(boston_scaled)
km3<-kmeans(dist_eu, centers=4)
boston_scaled<-as.data.frame(boston_scaled)
lda.fit2<-lda(km3$cluster~.,data=boston_scaled)
lda.fit2
## Call:
## lda(km3$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2272727 0.4308300 0.2114625 0.1304348 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
## 2 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 3 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
## 4  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
##          age        dis        rad        tax     ptratio       black
## 1  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
## 2 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 3 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
## 4  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
##        lstat        medv
## 1  0.6116223 -0.54636549
## 2 -0.2813666 -0.01314324
## 3 -0.8200275  1.11919598
## 4  1.1930953 -0.81904111
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim    -0.18113078 -0.5012256 -0.60535205
## zn      -0.43297497 -1.0486194  0.67406151
## indus   -1.37753200  0.3016928  1.07034034
## chas     0.04307937 -0.7598229 -0.22448239
## nox     -1.04674638 -0.3861005 -0.33268952
## rm       0.14912869 -0.1510367  0.67942589
## age      0.09897424  0.0523110  0.26285587
## dis     -0.13139210 -0.1593367 -0.03487882
## rad     -0.65824136  0.5189795  0.48145070
## tax     -0.28903561 -0.5773959  0.10350513
## ptratio -0.22236843  0.1668597 -0.09181715
## black    0.42730704  0.5843973  0.89869354
## lstat   -0.24320629 -0.6197780 -0.01119242
## medv    -0.21961575 -0.9485829 -0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636
lda.arrows<-function(x, myscale=1, arrow_heads=0.1, color="red", tex=0.75, choices=c(1,2)){
  heads<-coef(x)
  arrows(x0=0, y0=0,
    x1=myscale*heads[,choices[1]],
    y1=myscale*heads[,choices[2]],
col=color, length=arrow_heads)
text(myscale*heads[,choices],labels=row.names(heads),
      cex=tex,col=color, pos=3)
}
classes=as.numeric(km3$cluster)
plot(lda.fit2, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit2, myscale=3)

Property tax rates (tax), proportion of residential land zoned (zn) and nitrogen oxides concentration (nox) are the most influential lienar separators for the clusters. These variables mostly influence the fourth cluster.


Dimensionality Reduction Techniques

human<-human<-read.csv("create_human.csv",row.names=1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ edu_ratio  : num  97 94.5 95.8 96 89.1 ...
##  $ labor_ratio: num  65 65.3 68.3 62.5 64.5 ...
##  $ edu_exp    : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ life_exp   : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI        : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MMR        : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ ABR        : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ fem_rep    : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

Human is a dataset with 155 rows/countries and 8 columns/variables. The variables are key measures in the Human Development and Gender Inequality Indices. edu_ratio (mutated from data in the Gender Inequality Index) describes the ratio of female and male populations with secondary education in each country. labor_ratio (mutated from the GII data) describes ratio of labor force participation of females and males in each country. edu_exp (from the Human Development Index) describes the expected years of schooling of the general population. life_exp (from the HDI) describes the life expectancy of the general population. GNI (HDI) describes the Gross National Income per capita of a country in 2014. MMR (GII) describes the maternal mortality ratio (deaths per 100,000 live births), and ABR (GII) describes the adolescent birth rate (births per 1,000 women ages 15-19). fem_rep (GII) describes the percentage of seats in parliament held by women.

install.packages(“tidyr”) install.packages(“ggplot2”) install.packages(“GGally”) install.packages(“corrplot”)

library(tidyr)
library(ggplot2)
library(GGally)
library(corrplot)
ggpairs(human)

cor(human)%>%corrplot

Life expectancy and adolescent birth rate have a relatively strong negative correlation, meaning that as the adolescent birth rate increases, life expectancy decreases, and vice versa. Expected education and education ratio have a relatively strong positive correlation, indicating that the ratio of female to male populations with secondary education increases and thus female educational inequality decreases as expected education increases, and vice versa. GNI, MMR and ABR all have right-skewed distriubtions, while labor_ratio, edu_exp and fem_rep has relatively normal distributions. Edu_ratio and life_exp have left-skewed distributions.

summary(human)
##    edu_ratio       labor_ratio       edu_exp         life_exp    
##  Min.   :  2.05   Min.   :40.90   Min.   : 5.40   Min.   :49.00  
##  1st Qu.: 33.27   1st Qu.:56.65   1st Qu.:11.25   1st Qu.:66.30  
##  Median : 56.95   Median :63.40   Median :13.50   Median :74.20  
##  Mean   : 58.03   Mean   :63.45   Mean   :13.18   Mean   :71.65  
##  3rd Qu.: 86.65   3rd Qu.:69.30   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :100.00   Max.   :89.15   Max.   :20.20   Max.   :83.50  
##       GNI              MMR              ABR            fem_rep     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
pca_human<-prcomp(human)
s<-summary(pca_human)
pca_pr<-round(100*s$importance[2,],digits=1)
pc_lab<-paste0(names(pca_pr),"(",pca_pr,"%)")
biplot(pca_human, cex=c(0.8,1), col=c("grey40","deeppink2"), xlab=pc_lab[1],ylab=pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

In the nonstandardized PCA, there appears to be no clear correlation between the original features themselves and the features and the principal components. The only exception is GNI, which exhibits a strong correlation with the first principal component. Interestingly, the first principal component explains 100 percent of the variation in the data, possibly because GNI explains all the variation. Most of the countries appear to be clustered in the middle, with little variation, explaining why the original features seem to have little relation to the principal component and each other and why PC1 explains all of the variability in the data.

human_std<-scale(human)
summary(human_std)
##    edu_ratio         labor_ratio           edu_exp       
##  Min.   :-1.91047   Min.   :-2.200017   Min.   :-2.7378  
##  1st Qu.:-0.84475   1st Qu.:-0.663486   1st Qu.:-0.6782  
##  Median :-0.03672   Median :-0.004972   Median : 0.1140  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.97695   3rd Qu.: 0.570617   3rd Qu.: 0.7126  
##  Max.   : 1.43259   Max.   : 2.507135   Max.   : 2.4730  
##     life_exp            GNI               MMR               ABR         
##  Min.   :-2.7188   Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325  
##  1st Qu.:-0.6425   1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394  
##  Median : 0.3056   Median :-0.3013   Median :-0.4726   Median :-0.3298  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6717   3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030  
##  Max.   : 1.4218   Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344  
##     fem_rep       
##  Min.   :-1.8203  
##  1st Qu.:-0.7409  
##  Median :-0.1403  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6127  
##  Max.   : 3.1850
pca_human_std<-prcomp(human_std)

s2<-summary(pca_human_std)
pca_pr2<-round(100*s2$importance[2,],digits=1)
pc_lab2<-paste0(names(pca_pr2),"(",pca_pr2,"%)")
biplot(pca_human_std, cex=c(0.8,1), col=c("grey40","deeppink2"), xlab=pc_lab2[1],ylab=pc_lab2[2])

summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.1585 1.0679 0.88106 0.68579 0.59186 0.53987
## Proportion of Variance 0.5824 0.1426 0.09703 0.05879 0.04379 0.03643
## Cumulative Proportion  0.5824 0.7249 0.82195 0.88074 0.92453 0.96096
##                            PC7     PC8
## Standard deviation     0.44297 0.34071
## Proportion of Variance 0.02453 0.01451
## Cumulative Proportion  0.98549 1.00000

In the standardized PCA, the correlations are more evident and the countries are more spread out with greater variation. Scaling the data down allows for greater visualization of variability in the data.

Female representation in parliament and the ratio of labor force participation betewen females and males are relatively correlated with each other and with principal component 2 (fem_rep has the stronger correlation with PC2), indicating that PC2 most likely explains the variation in the country data relating to gender inequality. However, PC2 only captures 14 percent of the variability in the data.The rest of the variables are strongly correlated with principal component one and with each other, although in different directions. Maternal Mortality Ratio and Adolescent Birth Rate are strongly correlated with each other and PC1. Gross National Income per capita, the ratio of female and male populations with secondary education in each country, expected years of education, and life expectancy are strongly correlated with each other and PC1, but in the opposite direction of MMR and ABR. It is likely that these variables explain the human (or economic) development variability between the countries expressed in the data. High ABR and MMR contribute negatively to human development while higher GNI, education expectations, life expectancies and levels of female secondary educations point to more economically developed countries. Labor_ratio seems to be equally correlated with PC1 and PC2, which makes sense because higher female participation in the labor force indicates both lower gender inequality and higher levels of economic development. PC1 explains 58.2 percent of the variability in the data.

Overall, PC1 and PC2 indicate the influence of different types of original features on the overall data. However, PC1 has significantly more explanatory power and is highly correlated with more original features. The features in PC1 are more related to human/economic development, and thus Principal Component One describes the variation in the data based on human/economic development. PC2 is more related to variation between countries in terms of gender inequality, and has less ability to capture the variation in the data based on the reality that more of the original features are related to the HDI. Even variables from the GII have a relationship with PC1, because development is related to gender equality.

install.packages(“FactoMineR”) install.packages(“dplyr”)

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.3.2
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
library(dplyr)
keep_columns<-c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time<-dplyr::select(tea, one_of(keep_columns))
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time)%>%ggplot(aes(value))+geom_bar()+facet_wrap("key", scales="free")+theme(axis.text.x=element_text(angle=45,hjust=1,size=8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

mca<-MCA(tea_time, graph=FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage="quali")

summary(tea$where)
##          chain store chain store+tea shop             tea shop 
##                  192                   78                   30
summary(tea$how)
##            tea bag tea bag+unpackaged         unpackaged 
##                170                 94                 36

According to the MCA summary, how and where contribute substantially to dim 1 and moreso than the other variables to dimension 2.

Dim 1 explains 15 percent of the variance in the data, while Dim 2 explains 14 percent. Many of the categories are similar, with the exception of other, chain store+tea shop, tea bag+unpackaged, green, unpackaged and tea shop. Most of these categories are from the variables how and where, indicating their influence on the first two dimensions and on the variation in the data as a whole.